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An explanation for the acquisition of word-object mappings is the associative learning in a cross- 
situational scenario. Here we present analytical results of the performance of a simple associative 
learning algorithm for acquiring a one-to-one mapping between N objects and TV words based solely 
on the co-occurrence between objects and words. In particular, a learning trial in our learning 
scenario consists of the presentation of C + 1 < N objects together with a target word, which refers 
to one of the objects in the context. We find that the learning times are distributed exponentially and 
the learning rates are given by In ^ ^^"^2 j m the case the N target words are sampled randomly 

and by In [ iv ( J 1 ] in the case they follow a deterministic presentation sequence. This learning 
performance is much superior to those exhibited by humans and more realistic learning algorithms 
in cross-situational experiments. We show that introduction of discrimination limitations using 
Weber's law and forgetting reduce the performance of the associative algorithm to the human level. 



I. INTRODUCTION 

Early word-learning or lexicon acquisition by children, 
in which the child learns a fixed and coherent lexicon 
from language-proficient adults, is still a polemic prob- 
lem in developmental psychology [T|. The classical asso- 
ciationist viewpoint, which can be traced back to empiri- 
cist philosophers such as Hume and Locke, contends that 
the mechanism of word learning is sensitivity to covari- 
ation - if two events occur at the same time, they be- 
come associated - being part of humans' domain-general 
learning capability. An alternative viewpoint, dubbed 
social-pragmatic theory, claims that the child makes the 
connections between words and their referents by under- 
standing the referential intentions of others. This idea, 
which seems to be originally due to Augustine, implies 
that children use their intuitive psychology or theory of 
mind [5] to read the adults' minds. Although a variety of 
experiments with infants demonstrate that they exhibit 
a remarkable statistical learning capacity [3] , the findings 
that the word-object mappings are generated both fast 
and errorless by children are difficult to account for by 
any form of statistical learning. We refer the reader to 
the book by Bloom |F for a review of this most contro- 
versial and fascinating theme. 

Regardless of the mechanisms children use to learn a 
lexicon, the issue of how good humans are at acquiring 
a new lexicon using statistical learning in controlled ex- 
periments has been tackled recently [U-JS] ■ In addition, it 
has been conjectured that statistical learning may be the 
principal mechanism in the development of pidgin |10j . 
In this context (pidgin), however, it is necessary to as- 
sume that the agents are endowed with some capacity to 
grasp the intentions of the others as well as to understand 
nonlinguistic cues, otherwise one cannot circumvent the 
referential uncertainty inherent in a word-object mapping 

The statistical learning scenario we consider here is 
termed cross-situational or observational learning, and it 
is based on the intuitive idea that one way that a learner 



can determine the meaning of a word is to find some- 
thing in common across all observed uses of that word 
[T2Tll4j . Hence learning takes place through the statis- 
tical sampling of the contexts in which a word appears. 
There are two competing theories about word learning 
mechanism within the cross-situational scenario, namely, 
hypothesis testing and associative learning (see [9] for a 
review) . The former mechanism assumes that the learner 
builds coherent hypotheses about the meaning of a word 
which is then confirmed or disconfirmed by evidence |15l - 
[18] . whereas the latter is based essentially on the count- 
ing of co-occurrences of word-object statistics [TH1 120] . 
Albeit associative learning can be made much more so- 
phisticated than the mere counting of contingencies [5] , in 
this contribution we focus on the simplistic interpretation 
of that learning mechanism, which allows the derivation 
of explicit mathematical expressions to characterize the 
learner's performance. 

Although cross-situational associative learning has 
been a very popular lexicon acquisition scenario since it 
can be easily implemented and studied through numer- 
ical simulations (see, e.g., [TOl |2"TH2"5] ). there were only 
a few attempts to study analytically this learning strat- 
egy [MJ [25] . These works considered a minimal model of 
cross-situational learning, in which the one-to-one map- 
ping between TV objects and N words must be inferred 
through the repeated presentation of C + 1 < N ob- 
jects (the context) together with a target word, which 
refers to one of the objects in the context. The co- 
occurrences between objects and words are stored in a 
confidence matrix, whose integer entries count how many 
times an object has co-occurred with a given word during 
the learning process. The meaning of a particular word 
is then obtained by picking the object corresponding to 
the greatest confidence value associated to that word, i.e., 
the object that has co-occurred more frequently with that 
word. In this paper, we expand on the work of Smith et 
al. [24] and offer analytical expressions for the learning 
rates of this minimal associative algorithm for different 
word sampling schemes, see Eqs. ([9]), (14) and (17). 



2 



To assess the relevance of our findings to the efforts on 
understanding how humans perform on cross-situational 
learning tasks, we use Monte Carlo simulations to com- 
pare the performance of the minimal associative algo- 
rithm with the performance of humans for short learn- 
ing times [6] and with the performance of a more elabo- 
rated learning algorithm for long times [7]. Our finding 
that the accuracy of the minimal associative algorithm 
is much higher than that observed in the experiments is 
imputed to the illimited storage and discrimination ca- 
pability of the algorithm. In fact, introduction of errors 
in the discrimination of confidence values according to 
Weber's law reduces the performance to a level below 
that of humans. Somewhat surprisingly, introduction of 
forgetting acts synergistically with our prescription for 
Weber's law resulting in an increase of performance that 
eventually matches the experimental results. 

The rest of this paper is organized as follows. In Sect. 
[IT] we describe the learning scenario and in Sect. |III| we 
introduce and study analytically the simplest associative 
learning scheme for counting co-occurrences of words and 
objects, in which the words are learned independently. 
We consider first the problem of learning a single word 
and then investigate the effect of using different word 
sampling schemes for learning the complete N-word lexi- 
con. In Sect. [TV] we compare the performance of the mini- 
mal associative algorithm with the performance exhibited 
by adult subjects. To understand the high efficiency of 
the algorithm we introduce constraints on its storage and 
discrimination capabilities and show how the constraint 
parameters can be tunned to describe the experimental 
results. Finally, in Sect. [V]we discuss our findings and 
present some concluding remarks. 



II. CROSS-SITUATIONAL LEARNING 
SCENARIO 

We assume that there are N objects, N words and 
a one-to-one mapping between words and objects. To 
describe the one-to-one word-object mapping, we use the 
index i — 1, . . . ,N to represent the N distinct objects 
and the index h = 1, . . . , N to represent the N distinct 
words. Without loss of generality, we define the correct 
mapping as that for which the object represented by i = 
1 is named by the word represented by h = 1, object 
represented by i = 2 by word represented by h — 2, and 
so on. Henceforth we will refer to the integers i and h 
as objects and words, respectively, but we should keep 
in mind that they are actually labels to those complex 
entities. 

At each learning event, a target word, say word h = 1, 
is selected and then C + 1 distinct objects are selected 
from the list of N objects. This set of C + 1 objects 
forms a context for the selected word. The correct object 
(i = 1, in this case) must be present in the context. The 
learner's task is to guess which of the C + 1 objects the 
word refers to. This is then an ambiguous word learning 



scenario in which there are multiple object candidates for 
any word. 

The parameter C is a measure of the ambiguity (and 
so of the difficulty) of the learning task. In particular, 
in the case C = N — 1 the word-object mapping is un- 
learnable. At first sight one could expect that learning 
would be trivial for C — since there is no ambiguity, but 
the learning complexity depends also on the manner the 
objects are selected to compose the contexts. Typically, 
the objects are chosen randomly and without replace- 
ment from the list of N objects (see, e.g., [2"5H2"5]), which 
for C — results in a learning error (i.e., the fraction of 
wrong word-object associations) that decreases exponen- 
tially with learning rate — ln(l — 1/-/V) as the number 
of learning trials t increases. This is so because there 
is a non-vanishing probability that some words are not 
selected in the t trials [25] . 

In order to avoid testing subjects on the meaning of 
words they never heard, most experimental studies on 
word-learning mechanisms use a deterministic word se- 
lection procedure which guarantees that all words are 
uttered before the testing stage, although some words 
may be spoken more frequently than others [1HZ] . Hence 
we consider here, in addition to the random selection pro- 
cedure, a deterministic selection procedure which guar- 
antees that all N words are selected in t = N trials. For 
this procedure the case C = is trivial and the learning 
error becomes zero at t = N. However, since encoun- 
tering words whose meaning is unknown is not a rare 
event in the real world (hence the utility of dictionar- 
ies), a non- uniform Zipfian random selection of words is 
likely to be a more realistic sampling scheme for learning 
natural word-referent associations (see, e.g., [2"5]). 



III. MINIMAL ASSOCIATIVE LEARNING 
ALGORITHM 

Here we consider one of the earliest mathematical 
learning models - the linear learning model [26 . The 
basic assumption of this model is that learning can be 
modeled as a change in the confidence with which the 
learner associates the target word to a certain object in 
the context. More to the point, this confidence is rep- 
resented by a matrix whose non-negative integer entries 
Phi yield a value for the confidence with which word h 
is associated to object i. We assume that at the outset 
(t = 0) all confidences are set to zero, i.e., phi = with 
i, h = 1, . . . , N and whenever object i* appear in a con- 
text in companion with target word h* the confidence 
Pi*h* increases by one unit. Hence at each learning trial, 
C + 1 confidences are updated. Note that this learning 
algorithm considers reinforcement only. 

To determine which object corresponds to word h the 
learner simply chooses the object index i for which ph% 
is maximum. In the case of ties, the learner selects one 
object at random among those that maximize the confi- 
dence phi- Recalling our definition of the correct word- 
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object mapping in the previous section, the learning al- 
gorithm achieves a perfect performance when phh > Phi 
for all h and i ^ h. The learning error £ at a given trial 
t is then given by the fraction of wrong word-object as- 
sociations. Note that we have phi < Phh with i ^ h since 
object i = h must appear in the contexts of all learning 
events in which the target word is h (see Sect. |II| . In 
this case, the learning error of any single word, say h, 
which we denote by e sw , is the reciprocal of the number 
of objects for which p^i = Phh with i =/= h. 

Interestingly, it can easily be shown that this very sim- 
ple and general learning algorithm is identical to the al- 
gorithm presented in [23] which is based on detecting the 
intersections of context realizations in order to single out 
the set of confounder objects at a given trial t. This 
equivalence has already been noted in the literature [27] 
(see also [5]). The minimal associative learning algorithm 
can be immediately adapted to incorporate more realis- 
tic features, such as finite memory and imprecision in 
the comparison of magnitudes, whereas the confounder 
reducing algorithm is restricted to an ideal learning sce- 
nario. 

A salient feature of the minimal associative learning 
algorithm which allows the analytical study of its per- 
formance is the fact that words are learned indepen- 
dently. This is easily seen by noting that the confidences 
Phi, i = 1,...,N are updated only when the target word 
h is selected. This means that, aside from a trivial rescal- 
ing of the learning time, our scenario is equivalent to 
the experimental settings (see Sect. IV) in which C + 1 
target words are presented together with a context ex- 
hibiting C + 1 objects, with each object associated to 
one of the target words [3HZ]- Taking advantage of this 
feature, we will first solve a simplified version of the cross- 
situational learning in which a given target word h (and 
its associated object i = h) appears in all learning tri- 
als whereas the C other objects (the confounders) that 
make up the rest of the context vary in each learning trial. 
Once the problem of learning a single word is solved (see 
Sect. Ill A), we can easily work out the generalization to 



learning the whole lexicon (see Sects. Ill B and III C ). We 
will use r to measure the time of the learning trials in 
the case of single- word learning and t in the whole lexicon 
learning case. 



A. Learning a single word 

Before any learning event has taken place, the target 
word may be associated to any one of the N objects, so 
the initial state of the learning error is always equal to 
(N — 1) /N. When the first learning event takes place, 
the target word may be incorrectly assigned to the C 
other confounder objects shown in the context, so the 
probability of error at the first trial is always equal to 
Cj (C + 1). In the second trial, there are two possibil- 
ities: the probability of error is unchanged because the 
same context is chosen or the probability of error de- 



creases to the value nj (n + 1) 1 with n < C because n 
confounder objects of the first context appeared again in 
the second trial. The same reasoning allows us to de- 
scribe the probability of error in any trial given that this 
probability is known in the previous trial as described 
next. 

As pointed out, the possible error values are nj (n + 1) 
with n = 0, 1, C . Labeling these values by the index 
n, the probability of error at trial r can be written as 

W (t) = (w c (t) , wc-i (t) , • • ■ ,ioi (r) , w (r)) . (1) 

The time evolution of W (r) is given by the Markov chain 

W (r + 1) = W (r) T, (2) 

where T is a (C + 1) x (C + 1) transition matrix whose 
entries T mn yield the probability that the error at 
a certain trial is n/(n + l) given that the error was 
m/ (m + 1) in the previous trial. Clearly, T mn = for 
m < n since the error cannot increase during the learning 
stage in the absence of noise. 

It is a simple matter to derive T mn for m> n [24] , In 
fact, it is given by the probability that in C choices one 
selects exactly n of the m confounder objects from the 
list of N — l objects. (We recall that the object associated 
to the target word is picked with certainty and so the list 
comprises N — 1 objects, rather than N, and the number 
of selections is C rather than C + 1.) This is given by 
the hyper-geometric distribution [28 
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(3) 



for m > n and T rnn = for m < n. Since the 
transition matrix is triangular, its eigenvalues A n with 
n = 0, 1, C are the elements of the main diagonal that 
correspond to transitions that leave the learning error 
unchanged, i.e., 



T 
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'N — 1 
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(4) 



Note that A = 1 > A„^i > as expected for eigenvalues 
of a transition matrix. In addition, since A„/A n +i = 
(N — 1 — n) / (C — n) > 1 the eigenvalues are ordered 
such that Ao > Ai > . . . > A^v-i- 

Recalling that the probability vector is known at r = 1, 
namely, ~Wi = (1, 0, . . . , 0) we can write 



W(t) = W(r = 1)T T 



(5) 



Although it is a simple matter to write T T_1 in terms of 
the right and left eigenvectors of T, this procedure does 
not produce an explicit analytical expression for W n (t) 
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the average learning error for a single word as 
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(7) 



which is valid for r > only. For r = one has 
e sw (0) = 1 — 1/N. The dependence of e sw on the num- 
ber of learning trials r for different values of N and C is 
illustrated in Fig. [I] using a semi-logarithmic scale. Ex- 
cept for very small r, the learning error exhibits a neat 
exponential decay which is revealed by considering only 
the leading non- vanishing contribution to W n for large r, 
namely, 




FIG. 1: (Color online) The expected single-word learning er- 
ror e aw as a function of the number of learning trials r. The 
solid curves are the results of Eq. |7|) and the filled circles the 
results of Monte Carlo simulations. The upper panel shows 
the results for C = 2 and (left to right) TV = 100, 50, 30 and 
20, and the lower panel for N = 20 and (left to right) C = 5, 
10, 13, 15 and 16. 
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Hence the learning rate for single-word learning is 
a sw =]n[(N-l)/C] 



(8) 



(9) 



which is zero in the case C = N — 1, i.e., all objects ap- 
pear in the context and so learning is impossible. In the 
case C = 0, the learning rate diverges so that e sw = 
at the first learning trial t = 1 already. Most interest- 
ingly, the learning rate increases with increasing N (see 
Fig. [I]) indicating that the larger the number of objects, 
the faster the learning of a single word. This appar- 
ently counterintuitive result has a simple explanation: a 
large list of objects to select from actually decreases the 
chances of choosing the same confounding object during 
the learning events. 



Learning the whole lexicon with random 
sampling 



in terms of the two parameters of the model C and N, 
since we are not able to find analytical expressions for the 
eigenvectors. However, Smith et al. T\} have succeeded 
in deriving a closed analytical expression for W n (r) using 
the inclusion-exclusion principle of combinatorics |29j , 

»'.(^(3EH)'-(L>r', <» 

v / i—n x ' 

where Xi, given by Eq. Q, is the probability that a 
particular set of i members of the C confounders in the 
first learning episode r = 1 appear in any subsequent 
episode. Although the spectral decomposition of T plays 
no role in the derivation of Eq. Q we choose to maintain 
the notation Xi for the above mentioned probability. 

Recalling that a situation described by n corresponds 
to the learning error nj [n + 1) we can immediately write 



We turn now to the original learning problem in which 
the learner has to acquire the one-to-one mapping be- 
tween the N words and the N objects. In this section 
we focus in the case the target word at each learning 
trial is chosen randomly from the list of N words. Since 
all words have the same probability of being chosen, the 
probability of choosing a particular word is 1/N. 

At trial t we assume that word 1 appeared k± times, 
word 2 appeared &2 times, and so on with k\ + &2 + • • ■ + 
kjq = t. The integers ki = 0, . . . , t are random variables 
distributed by the multinomial 

P (h, ...,k N ) = t! fcjv , <W... +fcw . (10) 

Clearly, if word i appeared ki times in the course of t 
trials then the expected error associated to it is e sw (ki) 
with the (word independent) single word error given by 
Eq. ([7]) for fcj > 0. With this observation in mind, we 
can immediately write the expected learning error in the 
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FIG. 2: (Color online) The expected learning error E r in the 
case the TV words are sampled randomly as a function of the 
number of learning trials t. The solid curves are the results 
of Eq. ( 12 1 and the filled circles the results of Monte Carlo 



simulations. The upper panel shows the results for C — 2 
and (left to right) TV = 10, 20, . . . , 80 and the lower panel the 
results for TV = 20 and (left to right) C = 1, 2, . . . , 10. 



case the TV words are sampled randomly, 
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The sum over k can be easily carried out provided we 
take into account the fact that e sw (k) has different pre- 
scriptions for the cases k — and k > 0. Wc find 
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(12) 



with Xi given by Eq. This is a formidable expression 
which can be evaluated numerically for C not too large 
and in Fig. [2] we exhibit the dependence of E r on the 
number of learning trials for a selection of values of TV 
and C . 

To obtain the asymptotic time dependence of E r we 
need to keep in the double sum only the leading order 
term. Since the summand in Eq. ( 12 ) vanishes for n = 0, 



the largest eigenvalue that appears in that expression is 
Ai, corresponding to the term i = n = 1, and so this is 
the term that dominates the sum in the limit t — > oo. 
Hence E r exhibits the exponential decay 



E r 



where 
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(14) 



is the learning rate of our algorithm in the case the TV 
words are sampled randomly. As already mentioned, it is 
interesting that the unambiguous learning scenario C = 
results in the finite learning rate — ln(l — 1/TV) simply 
because some words may never be chosen in the course 
of the t learning trials. Interestingly, the learning rate 
a r exhibits a non-monotone dependence on TV for fixed 
C: for TV > 2C + 1, it decreases with increasing TV (this 
is the parameter selection used to draw the upper panel 
of Fig. |2|, and it increases with increasing TV otherwise. 
Recalling that for fixed C the minimum value of TV is 
TV = C + 1 at which a r = 0, increasing TV from this 
minimal value must result in an increase of a r . The fact 
that a r decreases for large TV - an effect of sampling - 
implies that there is an optimal value TV* = 2C + 1 that 
maximizes the learning speed for fixed C . Of course, for 
fixed TV the learning speed is maximized by C = 0. 



C. Learning the whole lexicon with deterministic 
sampling 

To better understand the effects of the random sam- 
pling of the TV words we consider here a deterministic 
sampling scheme in which every word is guaranteed to 
be chosen in the course of TV learning trials. Let us begin 
with the first TV learning trials and recall that at time 
t = all words have error e sw (0) = (TV — 1) /TV. Then 
during the learning process for t = 1, . . . ,N there will 
be t words with error e sw (1) = C / (C + 1) and TV — t 
with error e sw (0) so that the total learning error for the 
deterministic sampling is 



E d (t) = ~ [te sw (1) + (TV - t) e sw (0)] 



t < TV. 



(15) 
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FIG. 3: (Color online) The expected learning error Ed for the 
case the N words are sampled deterministically as a function 
of the number of learning trials t. The solid curves are the 
results of Eq. ( 16 1 and the filled circles the results of Monte 



Carlo simulations. The upper panel shows the results for C = 
2 and (left to right) N = 10, 20, ... , 100 and the lower panel 
the results for N = 20 and (left to right) G = 1, 2, . . . , 10. 



This expression can be easily extended for general t by 
introducing the single- word learning time r = \ t/N\, 



E d {t) 



1 

N 



[(* - Nt) e sw (t + 1) + (Nt + N- t) e sw (r)] 

(16) 

where \x\ is the largest integer not greater than x. The 
time-dependence of the learning error for the determin- 
istic sampling of the N words is shown in Fig. [3] For 
t 3> N, t becomes a continuous variable for any practi- 
cal purpose, and then we can see that Ed decreases ex- 
ponentially with increasing t. Clearly, the learning rate 
is determined by the single-word learning error [see Eq. 
pH)] and so replacing r by t/N in that equation we obtain 
the learning rate for the deterministic sampling case 



a d (C,N) = -\n 



N-l 



C 



(17) 



in the absence of ambiguity the learning task should be 
completed in N steps. In fact, the learning error de- 
creases linearly with t as given by Eq. (15 1. Similarly to 



our findings for the random sampling, ad exhibits a non- 
monotonic dependence on N: beginning from ad = at 
N = C + 1, it increases until reaching a maximum at 
N* sa eC and then decreases towards zero again as the 
size of the lexicon further increases. 

It is interesting to compare the learning rates for the 
two sampling schemes, Eqs. (14) and (fl7|). In the lead- 



As in the single-word learning case, the learning rate di- 
verges for C = in accordance with our intuition that 



ing non- vanishing order for large N and C <C N , we find 
a r w C/N 2 whereas ad ~ (In TV) /N . In the more realis- 
tic situation in which the context size grows linearly with 
the lexicon size, i.e., C = with 7 <E [0, 1], for large N 
we find a r « (1 — 7) /N and a<j s» — (In 7) /N. Hence for 
small C or 7 ss 0, the deterministic sampling of words re- 
sults in much faster learning than the random sampling. 
For large C or 7 s=s 1, however, the two sampling schemes 
produce equivalent results. 



IV. EFFECTS OF IMPERFECT MEMORY AND 
DISCRIMIN ABILITY 

The simplicity of the minimal associative learning al- 
gorithm analyzed in the previous section is deceiving. In 
fact, the algorithm contains two assumptions that make 
it extremely powerful. The first assumption is illimitcd 
memory, since the algorithm stores the confidence values 
from the very first to the last learning episode, regardless 
of the number of learning episodes. The second is per- 
fect discriminability, since it always identifies the largest 
confidence regardless of the closeness to, say, the second- 
largest one. 

The scheme we use to relax the perfect discriminability 
assumption is inspired by Weber's law, which asserts that 
the discriminability of two perceived magnitudes is deter- 
mined by the ratio of the objective magnitudes. Accord- 
ingly, we assume that the probability that the algorithm 
selects object i as the referent of any given word h is sim- 
ply phi I Phj ; so that referents with similar confidence 
values have similar probabilities of being selected. This 
differs from the original minimal algorithm for which the 
referent selection probability is either one or zero, except 
in the case of ties when the probability is divided equally 
among the referents with identical confidence values. 

Forgetting or decaying of the confidence values is im- 
plemented by subtracting a fixed factor j3 £ [0, 1] from 
the confidences phi , i = 1, . . . ,N whenever word h is ab- 
sent from a learning episode. The problem with this pro- 
cedure is that the confidence values may become nega- 
tive and when this happens we reset them to zero. An- 
other difficulty that may rise is when = for all 
i = 1, . . . , N and in this case we reset phi = l/N for all 
i = 1,...,JV. These resetting procedures are responsi- 
ble for the discontinuities observed in the performance 
of the algorithm as we will see next. As in the minimal 
algorithm, we add 1 to the confidences associated to the 
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target word and the objects exhibited in the context. 

Relaxation of the perfect memory assumption makes 
the forgetting parameter /? dependent on the sampling 
scheme of words, which precludes an analytical approach 
to this problem. As we have to resort to simulations to 
study the performance of the modified algorithm any- 
way, in this section we consider a very specific sampling 
scheme used in experiments with adult subjects to test 
the effect of varying the frequency of presentation of the 
target words on their learning performances [BJ. More 
importantly, use of this sampling scheme allows us to 
compare quantitatively the performance of the minimal 
as well as of the modified associative learning algorithms 
with the performances of the adult subjects. 

The experiment we consider here aims at evaluating 
the performance of the associative algorithms in learning 
a mapping between N — 18 words and N = 18 objects 
after 27 training episodes [BJ. Each episode comprises 
the presentation of 4 objects together with their corre- 
sponding words. Following Ref. [BJ, we investigate two 
conditions. In the two frequency condition, the 18 words 
are divided into two subsets of 9 words each. In the first 
subset the 9 words appear 9 times and in the second only 
3 times (see Fig. [4|. In the three frequency condition, 
the 18 words are divided in three subsets of 6 words each. 
In the first subset, the 6 words appear 3 times, in the sec- 
ond, 6 times and in the third, 9 times (see Fig. [5]). In 
these two conditions, the same word was not allowed to 
appear in two consecutive learning episodes. 

Once the cross-situational learning scenario is defined, 
we carry out 10 4 runs of the modified associative learning 
algorithm for a fixed value of the forgetting parameter. 
The results are shown in terms of the average accuracy 
1 — (e) as function of /3 in Figs. [4] and [5] The horizontal 
straight lines and the shaded zones around them repre- 
sent the means and standard deviations of the results of 
experiments carried out with 33 adult subjects [BJ. 

Before discussing the interesting dependence of the ac- 
curacy on the forgetting parameter exhibited in Figs. [4] 
and [5] a word is in order about the performance of the 
original minimal algorithm that is not shown in those 
figures. In the two frequency condition, the mean ac- 
curacy is 0.99 for words in the 9-repetition subset and 
0.90 for those in the 3-repetition subset. In the three fre- 
quency condition, the mean accuracy is 0.99 for words 
in the 9- and 6-repetition subsets, and 0.91 for those 
in the 3-repetition subset. These accuracy values are 
well above those exhibited in Figs. [4] and [5] Moreover, 
adding the forgetting factor to the minimal associative 
algorithm does not affect its performance, since subtract- 
ing the same quantity from all confidence values for 
a fixed word h does not alter the rank order of these 
confidences. 

Although we intuitively expect that words that appear 
more frequently would be learned better, this outcome 
actually depends on the value of the forgetting param- 
eter as shown in Figs. [4] and [5j This counterintuitive 
finding was first observed in the three frequency condi- 




FIG. 4: (Color online) Expected accuracy for the two fre- 
quency condition as function of the forgetting parameter j3 at 
learning trial t — 27. The curves show the accuracy of the 
set of words sampled 9 and 3 times as indicated in the figure. 
The horizontal lines and the shaded zones are the experimen- 
tal results [BJ. For /3 « 0.16 we get an excellent agreement 
between the model and experiments. 




FIG. 5: (Color online) Expected accuracy for the three fre- 
quency condition as function of the forgetting parameter /3 at 
learning trial t = 27. The curves show the accuracy of the set 
of words sampled 9, 6 and 3 times as indicated in the figure. 
The horizontal lines and the shaded zones are the experimen- 
tal results [BJ. For ft « 0.08 we get an excellent agreement 
between the model and experiments. 



tion experiment on adult subjects [BJ. In fact, the results 
of those experiments (i.e., the expected accuracies) can 
be described very well by choosing (3 = 0.16 in the two 
frequency condition and (3 = 0.08 in the three frequency 
condition. 

It is interesting that the choice of a moderate value for 
the forgetting parameter /3 may result in a considerable 
improvement of the performance of the algorithm. This 
is a direct consequence of Weber's law prescription for 
the discrimination of the confidence values and so there 
is a synergy between discrimination and memory in our 
algorithm. To see this we note that at a given learning 
trial the ratio between the probabilities of selecting refer- 
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ent i = 1 and referent i = 2 for a word ft is r = Phi/Ph2- 
If word h does not appear in the next trial then this ratio 
becomes 



Phi ~ P 
Ph2 - P 



P 



PhlPh2 



(Phi - Phi) 



(18) 



so that r' > r if p^i > Ph2, thus implying that the for- 
getting parameter helps the discrimination of the largest 
confidence. Of course, too large values of /? deteriorate 
the performance of the algorithm as shown in the figures. 
We note that the dents and jumps in the learning curves 
are not statistical fluctuations but consequences of the 
discontinuities introduced by the ad hoc regularization 
procedures discussed before. 

The above analysis, summarized in part by Figs. [4] 
and [5j evinces the better performance of the associative 
algorithm with perfect storage and discrimination capa- 
bilities when compared with humans' performance for a 
finite number of learning trials (t = 27, in the case). In 
addition, it shows that introduction of imprecision in the 
discrimination of confidence values following Weber's law 
prescription together with forgetting brings that perfor- 
mance down to the human level. 

For the sake of completeness, it would be interesting 
to compare the performance of the minimal associative 
algorithm with humans' performance in the limit of very 
long learning times, which was in fact the main focus of 
Sect. |III| As there are no such experiments - we guess 
it would be nearly impossible to keep the subjects' at- 
tention focused on such boring tasks for too long - next 
we compare the performance of the minimal algorithm 
with the performance of a rather sophisticated learning 
algorithm which, among other things, models the atten- 
tion of the learners to regular and novel words p[j. The 
algorithm is described briefly as follows. At any given 
trial, the confidence values phi are adjusted according to 
the update rule 



Phi 



>hi 



p hi exp[X(H h + Hj)} 
: E W P/»exp[A (H h + Hi)] 



where 



H h = -^A M lnA w 



(19) 



(20) 



with Aw = Phi/J2i Phii an d similarly for Hi with the in- 
dexes of the sums running over the set of words [TJ. In 
this equation the entropies Hh and Hi are used as mea- 
sures of the novelty of word h and object i at the current 
learning episode. The parameter governs forgetting, \ 
is the weight distributed among the potential associations 
in the trial, and A weights the uncertainty (entropies) and 
prior knowledge (phi)- We refer the reader to Ref. [7J for 
a detailed explanation of the algorithm as well as for a 
comparison with experimental results for short learning 
times. Here we present its performance in acquiring the 
word-object mapping in the simplified scenario of Sect. 




FIG. 6: (Color online) Expected learning error for N = 10 and 
C = 2 as function of the number of learning trials t in the case 
words are sampled randomly. The open circles are results of 
the minimal associative algorithm whereas the filled symbols 
are the results of the algorithm proposed by Karchergis et 
al. 0: diamonds (x = 3.01, A = 1.39, $ = 0.64), circles 
(X = 0.31, A = 2.34,/§ = 0.91), and squares (x = 0.20, A = 
0.88,4 = 0.96). 



Ill (i.e., one target word and C+l objects in the context) 
for randomly sampled words. 

Figure [6] summarizes our findings for N = 10, C = 2 
and three selection of the parameter set (x, A, 0) used by 
Karchergis et al. to reproduce the experimental results 
[TJ. The symbols in this figure represent an average over 
10 4 independent samples. The expected learning error 
decreases exponentially with increasing t and the rate of 
learning (the slope of the learning curves for large t in 
the semi-log scale) is roughly insensitive to the choice of 
the parameters of the algorithm. As expected from our 
previous analysis of short learning times, the minimal 
associative learning algorithm performs much better than 
the more realistic algorithm. These conclusions hold true 
for a vast variety of different selections of N and C, as 
well as for the deterministic word sampling scheme. 



V. DISCUSSION 

As the problem of learning a lexicon within a cross- 
situational scenario was studied rather extensively by 
Smith et al. [Mj, it is appropriate that we highlight 
our original contributions to the subject in this conclud- 
ing section. Although we have borrowed from that work 
a key result for the problem of learning a single word, 
namely, Eq. ([6]), even in this case the focal points of our 
studies deviate substantially. In fact, throughout the pa- 
per our main goal was the determination of the learning 
rates in several learning scenarios, whereas the main in- 
terest of Smith et al. was in quantifying the number of 
learning trials required to learn a word with a fixed given 
probability 24J. In addition, those authors addressed the 
problem of the random sampling of words using various 
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approximations, leading to inexact results from where 
the learning rate a r , see Eq. (14 1, cannot be recovered. 



from where we get 



As a result, the interesting non-monotonic dependence of 
a r (and ay, as well) on the size N of the lexicon passed 
unnoticed. The study of the deterministic sampling of 
words and the introduction and analysis of the effects 
of limited storage and discrimination capabilities on the 
original minimal associative algorithm are original con- 
tributions of our paper. 

We note that in the cross-situational scenarios studied 
previously [231 the set of objects that can be asso- 
ciated to a given word is word-dependent, rather than 
constant as considered here. In other words, if the target 
word is h then the elements of the context in a learning 
episode are drawn from a fixed subset of Nh < N ob- 
jects. These subsets can freely overlap with each other. 
Here we have assumed Nh = N for h = 1, . . . , N. Of 
course, this generalization does not affect the analysis of 
the single-word learning, except that e sw becomes word- 
dependent since the parameter N is replaced by Nh [see 
Eq. and similarly for the learning rate a sw [see Eq. 
^9J], More importantly, since words are learned indepen- 
dently by the minimal associative algorithm, the single- 
word learning errors contribute additively to the total 
lexicon learning error regardless of the sampling proce- 
dure [see Eqs. ( [TT| ) and ( 16)]. Hence the asymptotic be- 
havior of the total error is determined by the word that 
takes the longest to be acquired, i.e., the word with the 
lowest learning rate or equivalently with the smallest sub- 
set cardinality N^. With this in mind we can easily ob- 
tain the learning rates for this more general situation, 
namely, a r = In {TV (N m -1)/[C+ {N m - 1) {N - 1)]} 
and ay = In [(N m - 1) jC\ /N where N m = 
mm h {N h , h = 1, ... , N}. As expected, in the case N m — 
N these expressions reduce to Eqs. (Il4| and (17). 



The cross-situational learning scenario considered here, 
as well as those used in experimental studies, does not 
account for the presence of external noise, such as the ef- 
fect of out-of-context target words. This situation can be 
modeled by introducing a probability 7 € [0, 1] that the 
correct object is not part of the context so the target word 
can be said to be out of context. Since we have assumed 
that learning is based on the perception of differences in 
the co-occurrence of objects and target words, in the case 
all N objects have the same probability of being selected 
to form the contexts regardless of the target word, such a 
purely observational learning is clearly unattainable. To 
determine the critical value of the noise parameter j c at 
which this situation occurs we simply equate the proba- 
bility of selecting the correct object with the probability 
of selecting any given confounding object to compose the 
context in a learning episode, 



7c = 1 



C+l 

N 



(22) 



Since in this case all objects and all words are equiv- 
alent, in the sense they have the same probability of 
co-occurrence, the average single-word learning error, as 
wells as the total error regardless of the sampling scheme, 
is simply e sw = 1 — 1/N. We refer the reader to Ref. 
[5U] for a detailed study of the behavior of the minimal 
associative learning algorithm near the critical noise pa- 
rameter using statistical mechanics techniques. Here we 
emphasize that the existence of 7 C is not dependent on 
the algorithm used to learn the word-object mapping. 
Rather, it is a limitation of cross-situational learning in 
general. 

The simplifying feature of our model that allowed an 
analytical approach, as well as extremely efficient Monte 
Carlo simulations (in all graphs the error bars were 
smaller than the symbol sizes), is the fact that words 
are learned independently from each other. In this con- 
text, the minimal associative algorithm considered here 
corresponds to the optimal learning strategy. Moreover, 
the fact that the minimal associative algorithm exhibits 
effectively illimited storage and discrimination capabil- 
ities makes its learning performance much superior to 
that of adult subjects in controlled experiments [6] and 
to that of sophisticated algorithms designed to capture 
the strategies used by humans in the observational learn- 
ing task [7j. Interestingly, introduction of errors in the 
discrimination of the confidence values using Weber's law 
reduced the performance of the minimal algorithm to the 
level reported in the experiments. Perhaps, sophisticated 
learning strategies such as the mutual exclusivity con- 
straint [15] , which directs children to map novel words to 
unnamed referents, have evolved to compensate the limi- 
tations imposed by Weber's law to evaluate the frequency 
of co-occurrence of words and referents. 



(l-7c)C , 7c (C+l) 



N - 1 



N - 1 



(21) 
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